Evaluating the use of blood pressure polygenic risk scores across race/ethnic background groups

We assess performance and limitations of polygenic risk scores (PRSs) for multiple blood pressure (BP) phenotypes in diverse population groups. We compare “clumping-and-thresholding” (PRSice2) and LD-based (LDPred2) methods to construct PRSs from each of multiple GWAS, as well as multi-PRS approaches that sum PRSs with and without weights, including PRS-CSx. We use datasets from the MGB Biobank, TOPMed study, UK biobank, and from All of Us to train, assess, and validate PRSs in groups defined by self-reported race/ethnic background (Asian, Black, Hispanic/Latino, and White). For both SBP and DBP, the PRS-CSx based PRS, constructed as a weighted sum of PRSs developed from multiple independent GWAS, perform best across all race/ethnic backgrounds. Stratified analysis in All of Us shows that PRSs are better predictive of BP in females compared to males, individuals without obesity, and middle-aged (40-60 years) compared to older and younger individuals.


The UK Biobank dataset
The UK Biobank (UKBB) cohort consists of 502,620 participants, recruited in the United Kingdom, and is described elsewhere in detail 1 . Participants answered questionnaires to assess medical conditions, lifestyle, and demographic information. Interviews were conducted by trained medical staff to assess medical history, health status and medication intake. At the time of the initial interview, participants had a medical exam which included the SBP and DBP and blood pressure measurements used in this study. The National Health Service National Research Ethics Service (ref. 11/NW/0382) gave approval for the study. As accepted elsewhere in genetic studies of BP phenotypes, SBP and DBP values were raised by 15 mmHg and 10 mmHg, respectively, in individuals using antihypertensive medications.

Identifying individuals of self-identified Black identity, and of African Ancestry
We were interested in assessing performance of BP PRS in individuals of Black identity in the UKBB cohort. We used self-reported race/ethnic background (UKBB Data-Field 21000) to select 8,646 individuals who self-reported as "Black or Black British" or "White and Black Caribbean" or "White and Black African" ethnicity, which were the study-defined ethnic identities that referenced "Black" or "African" identity.
We also identified a subset of individuals with predominately African genetic ancestry, defined as proportion of continental African ancestry≥ 0.8. We used these individuals for a secondary analysis of PRS performance in ancestry-defined groups, and for a secondary analysis of scaling + matching of PRS distributions between datasets. To identify such individuals, we performed an unsupervised analysis of ancestry proportions: we followed Constantinescu et al. 2 in analyzing all UKB non-European individuals (i.e., we excluded from the UKB dataset all individuals selfreported being of European or White ethnicity). We cleaned and pruned the data by removing genotypes with ≥ 1% missingness, followed by LD pruning using PLINK with the settings -indep-pairwise 50 10 0.1, i.e., removing SNPs with LD>0.1 within 50Kb distances of each other, and with step size variant count =10. Next, we applied the SCOPE software 3 on an unrelated set of individuals to compute admixture proportions using unsupervised analysis with k=4 ancestries. The number of ancestries was selected based on the number of super populations in the results reported by Constantinescu et al. 2 We identified component corresponding to African ancestry as the one with high proportions in individuals with Black identity. Out of the Black individuals, 5,816 were selected as having predominately African ancestry.

Genotype data and imputation
UKBB participants were genotyped on two closely related genotyping arrays: In total, 488,282 participants were genotyped on two closely related arrays: UK BiLEVE: N = 49,939, and UK Biobank Axiom: 438,343, and genotypes were imputed using reference sequence data as previously described 4 .

PRS calculations
The PRS models, consisting of selected genetic markers and associated weights, calculated in the primary analysis using PRSice 2 software 5 were ported to positions and alleles to match the UKBB imputed genotypes (build hg19). Then, we applied PRSice without additional clumping (i.e. using the clumping from the TOPMed BP cohort) to the selected markers at the 5x10 -8 , 1x10 -7 , 1x10 -5 , and 1x10 -2 p-value thresholds. In primary analysis, PRS were scaled to have mean 0 and variance 1 in the UKBB-Black dataset by subtracting the mean PRS value and dividing by its standard deviation, computed on the TOPMed-BP dataset. In secondary analyses we also performed scaling + matching of PRS across platforms as described latter. PRS based on summations, association analyses, and performance analysis of PRS were calculated as described in the main manuscript.

Mass General Brigham Biobank methods
Samples, genomic data, and health information were obtained from the Mass General Brigham Biobank, a biorepository of consented patient samples at Mass General Brigham. Phenotypic data was extracted from the MGB Biobank on February 1, 2023.

DNA samples
DNA samples are processed from whole blood that was collected as a dedicated research draw or as a clinical discard. Dedicated research samples are aimed to be processed within four hours of collection. Clinical discards are processed 24+ hours after collection. Whole blood is spun to buffy coat with a centrifuge and the buffy coat is stored in a freezer up to several months. The buffy coat is then extracted to DNA. The DNA is then placed in an ultralow freezer (-80ºC).Each DNA aliquot contains a minimum of 2 ug of DNA. The concentration varies.

Genotyping
Samples have been genotyped using three versions of the biobank SNP array offered by Illumina that is designed to capture the diversity of genetic backgrounds across the globe. The first batch of data was generated on the Multi-Ethnic Genotyping Array (MEGA) array, the first release of this SNP array. The second, third, and fourth batches were generated on the Expanded Multi-Ethnic Genotyping Array (MEGA Ex) array. All remaining data were generated on the Multi-Ethnic Global (MEG) BeadChip.

Imputation
Prior to performing imputation, files were converted to VCF format, separated by chromosomes.
When multiple probes measured the same genotypes, they were checked for concordance and were set to a missing value if the genotypes did not match. Files were uploaded to the Michigan Imputation Server, and Genotypes were imputed using TOPMed reference panel. Genomic coordinates are provided in GRCh38.
We computed principal component (PC) using PLINK: we pruned the genotype data using a window size of 1000 variants, sliding across the genome with a step size of 250 variants at a time, filtering out any SNPs with LD R 2 >0.1. We used unrelated individuals (3rd degree, identified using PLINK) to compute the loadings for the first 10 PCs.

PRS construction
We constructed all PRS using PRSice 2, using the same SNPs as those selected in the various developed PRS, i.e. without further clumping. In primary analysis, we scaled the PRSs by subtracting the mean and dividing by SD computed on the TOPMed-BP dataset. In secondary analysis, we implemented additional scaling + matching of PRS across datasets as described later.

Identification of individuals with predominant European ancestry
To apply scaling + matching approaches of PRS across datasets (TOPMed, MGB Biobank, and UKBB), we identified individuals of predominately European ancestry, defined as having a proportion of at least 0.8 of continental European ancestry. We used ADMIXTURE software 6 to compute proportions of genetic ancestry in the MGB Biobank. First, we prepared the genetic data: we removed genotypes with ≥ 1% missingness, followed LD pruning using PLINK with the settings --indep-pairwise 50 10 0.1, i.e., removing SNPs with LD>0.1 within 50Kb distances of each other, and with step size variant count =10. Next, we applied ADMIXTURE on an unrelated set of individuals in an unsupervised analysis, with k=4. We identified the component corresponding to European ancestry as the component with higher proportions in individuals self-reported as White. Out of 33,855 individuals self-reported as White, 20,936 had predominately European ancestry.

Ethics statement
All Biobank subjects have provided their consent to join the Partners Biobank, which includes agreeing to provide a blood sample linked to the electronic medical record. Subjects also agree to be recontacted by the Partners Biobank staff as needed.

Acknowledgements
We thank Mass General Brigham Biobank for providing samples, genomic data, and health information data.

Global genetic ancestry inference in TOPMed
Ancestry inference was performed by the TOPMed Informatics Research Center (IRC). First, local ancestry was inferred using RFMix 7 , with default parameter settings except the following option: --node-size=5. Then, global ancestry was computed as for each participant as a weighted average of the ancestries in inferred local ancestry intervals. The reference panel used was the Human Genome Diversity Panel (HGDP) downloaded from the Stanford HGDP website http://hagsc.org/hgdp/files.html. Genomic coordinates were lifted over from genome build 37 to build 38. The 53 HGDP populations were merged into 7 super-populations: Sub-Saharan Africa, Central and South Asia, East Asia, Europe, Native America, Oceania, Middle East. Local ancestry inference was performed in two versions. First, for samples available in TOPMed freeze 6, RFMix V1 was used, and local ancestry was inferred for the autosomes only. Later, for samples participating only in freeze 8 (but not in freeze 6), and for the X-chromosome, local ancestry inference was performed using RFMix V2.

Secondary analysis of scaling and scaling + matching to address PRS distribution differences across datasets
We considered a few approaches for scaling and scaling + matching PRS. Here scaling + matching refers to the idea of explicitly scaling PRSs in various datasets so that PRS distribution agree in some objective criterion across the datasets. Our primary scaling approach implicitly assumes that the distributions of PRSs match between datasets: we computed the means and SDs of PRSs in TOPMed-BP and used the same means and SDs to scale the corresponding PRSs in all datasets (TOPMed-BP, MGB Biobank, and UKBB). In secondary analysis, we also scaled PRS independently in each dataset using dataset-specific mean and SD. We attempted two scaling + matching approaches: matching PRS distributions across datasets in groups defined by (a) genetic ancestry, and (b) self-reported race/ethnicity. In each of these scaling + matching instances, we identified groups of individuals who are similar in either their genetic ancestry (a) or by self-reported race/ethnicity (b) and computed the means and SDs of PRS in these groups.
Then, used these means and SDs to scale PRSs within the respective dataset. In mathematical notation, let ! " be the value of a given PRS in individual from group . In a given dataset, let the mean and SD of a PRS computed over all individuals from group , be defined as: We next scale all individuals within the dataset, regardless of their specific group, using " and " . Importantly, for individuals from group , after applying the scaling transformation where

All of Us Methods
At the time of analysis, there were 98,590 WGS samples available from All of Us. Report of sequencing and quality control methods are provided in this link: https://www.researchallofus.org/wp-content/themes/research-hub-wordpresstheme/media/2022/06/All%20Of%20Us%20Q2%202022%20Release%20Genomic%20Quality %20Report.pdf . All of Us provided PCs and performed relatedness analysis. In this work we removed first-and second-degree relatives to generate a set of individuals in which the relatedness between any pair is of degree third and higher using the pre-computed pairs provided by All of Us. The data were accessed on September 13, 2022.

Genetic ancestry in All of Us
The All of Us study team provided ancestry labels, however, the computation of ancestry in All of Us differed from that in TOPMed, UKBB, and MGB Biobank. Proportions of global ancestries were not computed, but rather an ancestry label was assigned to each participant according to their "location" in the PC space, guided by self-reported labels from the survey.
Accordingly, in All of Us one of the ancestries is "AMR" referring to Latino/Admixed American, whereas in TOPMed we refer instead to the parent ancestries of Admixed Hispanics/Latinos as European, African, and Amerindian. To clarify this difference we refer to "ancestry" groups in All of Us as groups defined by a combination of self-reported race/ethnicity and genetic similarity.

BP phenotypes in All of Us
We followed a previously develop hypertension analysis in All of Us using a Jupyter notebook that was made available to All of Us researchers via a workspace called "Demo -Hypertension Prevalence". In brief, we extracted measured SBP and DBP physical exam data for all participants who have WGS data. Afterwards, the earliest measurements were selected for use.
We used age at the age of BP measurement used, and exacted BMI and antihypertensive medications from either the electronic health record (EHR) or the physical exam data, and from the closest date to when BP was measured. Following the analysis in the shared Jupyter notebook, antihypertensive medications were defined as peripheral vasodilators, agents acting on the renin-angiotensin system, beta blocking agents, antihypertensives, calcium channel blockers, diuretics. SBP and DBP values were raised by 15 mmHg and 10 mmHg, respectively, in individuals with first date of using antihypertensive medications before or at the same year of their BP measurement. Association analyses were adjusted to age, sex at birth, BMI, and the first 10 PCs of genetic data.

PRS construction
We extract HapMap SNP in from the Hail table of the All of Us WGS data. We filtered out SNPs with call rate lower than 99%, with MAF<1% in the All of Us dataset, SNPs that failed QC filters, and SNPs with missing genotypes in ≥1% of the analytic sample. We then converted the resulting hail table into a PLINK file. Next, we constructed the PRS-CSx2 ancestry-specific PRS using PLINK v1.9, standardized them using TOPMed means and SDs, and applied PRS summation weights computed using the MGB dataset to combine the PRSs. We then standardized the combined PRS again using TOPMed means and SDs.

Clinical outcomes
We computed the association of PRS with multiple adverse clinical outcomes, including hypertension, and other outcomes known to be associated with hypertension: type 2 diabetes, chronic kidney disease, coronary artery disease, atrial fibrillation, and heart failure. Hypertension was defined according to the AHA guidelines, as SBP≥ 130, DBP≥ 80, or use of antihypertensive medications. For these, we used SBP, DBP, and medication information from the main analysis. For other clinical outcomes we used the All of Us data browser to identify the top medical condition in terms of patient number and its corresponding SNOMEDCode and OMOP Concept ID as mapped by the All of Us dataset (note that the same standard concepts were mapped to these SNOMEDCodes and OMOP Concept IDs). The codes and standard concept names are provided in Supplementary Table 9.

Ethics statement
The All of Us research program was approved by a single IRB, the "All of Us IRB", which is charged with reviewing the protocol, informed consent, and other participant-facing materials for the All of Us Research Program. The IRB follows the regulations and guidance of the Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly. More information is provided online https://allofus.nih.gov/about/who-we-are/institutional-review-board-irb-of-all-of-us-researchprogram and in the All of Us design paper 8 .

Acknowledgements
The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional

Supplementary Tables
Supplementary     MGB participant characteristics are based on a database query from February 1, 2023. Age is at the time of database query. SBP and DBP values were the medians in the health records. Obesity status was determined based on chart-validated algorithm 9 using BMI values, obesity and diabetes diagnosis codes, with threshold set by having positive predictive value of 0.90. Hypertension was determined based on a chart-validated algorithm using prescription and diagnosis codes related to hypertension, with threshold set by having positive predictive value of 0.95. Hypertension medication refers to a history of having any hypertension medication as described in the main manuscript.  For each ancestry-specific PRS, the table provides its mean and SD in the multi-ethnic TOPMed-BP dataset. All PRS were standardized using TOPMed-BP means and SDs. The means and SDs are provided for the combined PRS: after combining standardized ancestryspecific PRS (standardized based on means and SDs in Supplementary Table 6, and weighted using weights in Supplementary Table 8), primary analysis used background-specific weights. Therefore, PRS were standardized in each background group separately. For each outcome, standard concept names were pre-grouped by the All of Us database via the corresponding SNOMEDcode and OMOP concept ID provided.

Supplementary Figures
Supplementary Figure 1: Estimated phenotypic and residual variances of SBP by study and diversity background.
Phenotypic variances were estimated based on the raw phenotypes. Residual variances were estimated after regressing SBP on covariates (age, age 2 , BMI, sex, smoking status, and 11 PCs). We used unrelated individuals for these computations. Abbreviations and definitions. BMI: body mass index; PC: principal component; SBP: systolic blood pressure.
Supplementary Figure 2: Estimated phenotypic and residual variances of DBP by study and diversity background.
Phenotypic variances were estimated based on the raw phenotypes. Residual variance were estimated after regressing DBP on covariates (age, age 2 , BMI, sex, smoking status, study site and 11 PCs). We used unrelated individuals for these computations. Abbreviations and definitions. BMI: body mass index; DBP: diastolic blood pressure; PC: principal component. The height of each bar represents estimated PVE, and intervals represent the 95% confidence intervals based on the 2.5% and 97.5% distribution percentiles from bootstrap performed using unrelated individuals. SBP PRS PVEs were computed in TOPMed-BP and UKBB Black individuals, where PRS summation weights were trained using biobank data. Different PRS scaling approaches across datasets were taken. TOPMed scaling (PRS are scaled using mean and standard deviations estimated on the TOPMed-BP dataset); scaling + matching using groups defined by genetic ancestry (European ancestry when matching MGB Biobank to TOPMed, and African ancestry when matching UKBB Black individuals to TOPMed); scaling + matching using groups defined by selfreported race/ethnicity (White or Black for MGB Biobank and UKBB, respectively); dataset-specific, independent, scaling (PRS in each dataset are independently scaled to have mean 0 and variance 1 in the dataset).          OR (log scale)

Supplementary Figure 25: Association of BP and hypertension PRS with prevalent clinical outcomes in the All of Us dataset among antihypertensive medication users
The figure visualizes the estimated associations (odds ratios; ORs; provided numerically and visualized as points in the forest plot) and 95% confidence intervals (CIs; provided numerically and visualized as the intervals in the forest plot) of the best performing SBP and PRSs as selected in the TOPMed dataset (PRS-CSx2), their simple sum, and the previously-developed HTN-PRS with prevalent outcomes in All of Us individuals using antihypertensives. The figure also provides the numbers of cases and controls for each outcome, association pvalues computed based on a two-sided       All study protocols were approved by the institutional review board at the University of Maryland Baltimore. Informed consent was obtained from each study participant.

Amish acknowledgements:
We gratefully acknowledge our Amish liaisons, research volunteers, field workers, and Amish The ARIC study has been described in detail previously 10,11 .

Ethics statement:
The ARIC study has been approved by Institutional Review Boards (IRB) at all participating institutions: University of North Carolina at Chapel Hill IRB, Johns Hopkins University IRB, University of Minnesota IRB, and University of Mississippi Medical Center IRB. Study participants provided written informed consent at all study visits.

ARIC acknowledgements:
The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services (contract numbers HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700004I and HHSN268201700005I).
The authors thank the staff and participants of the ARIC study for their important contributions.

BioMe
The

Ethics statement:
The BioMe cohort was approved by the Institutional Review Board at the Icahn School of Medicine at Mount Sinai. All BioMe participants provided written, informed consent for genomic data sharing.

BioMe acknowledgements:
The and 71%, respectively) and written informed consent was obtained in each visit.

Ethics statement:
All CARDIA participants provided informed consent, and the study was approved by the Institutional Review Boards of the University of Alabama at Birmingham and the University of Texas Health Science Center at Houston.

CARDIA acknowledgements:
The The Framingham Heart Study was approved by the Institutional Review Board of the Boston University Medical Center. All study participants provided written informed consent.

FHS acknowledgements:
The  follow-up of the WHI-CT and WHI-OS cohorts has continued for over 25 years, with the accumulation of large numbers of diverse clinical outcomes, risk factor measurements, medication use, and many other types of data.

Ethics statement:
All WHI participants provided informed consent and the study was approved by the Institutional Review Board (IRB) of the Fred Hutchinson Cancer Research Center.

WHI acknowledgements:
The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes

Ethics statements:
All THRV participants provided informed consent, and the study was approved by the